Exploratory Data Analysis

This document has the purpose to get familiar with the stock market data

We will be using the Alpha Vantage API for multiple reasons

  • real time data that will be updated everytime we work with the data
  • accurate data for a wide variety of different stocks
  • have built-in functions for different types of analysis

So this is the data that we have by the API. To start, let's first get to know what each column means.

Open is the value of the given stock when the stock market opened on a given day.
High is the highest value that the given stock has reached during the course of the given day.
Low is the opposite of high - lowest value of the stock on this day.
Close is the value of the given stock when the stock market closes on the given day.
Adjusted Close is the actual close value since 'Close' in this data is a raw data.
Volume indicates the amount of shares that were bought or sold on the given day.
Divident Amount refers to a reward, cash or otherwise, that a company gives to its shareholders.
Split coefficient is when a company divides the existing shares of its stock into multiple new shares to boost the stock's liquidity.

For this project I will use the Facebook Stock because:

  • one of the most famous stocks on the market
  • has a long history of records (since 2012)

Let's begin by checking for any missing data

Since there is no single row in in this data we can assume that there are not any missing values in our dataframe.


Above the close value of the Facebook stock could be seen. However, I have tested the API for several other stocks and for some of them the Close values are sometimes incorrect. Initially the data that I used only had this closing values but after reading some documentation I found that this 'Close' data is raw and if I need the actual closing value I will need different call to the API to get different dataset.

Now we can see the accurate closing prices for Facebook since the stock became public in mid 2012. On first look there appears to be no difference but there are in fact little differences. Let's display it.

The lines overlap which means that the Close value in this data is pretty accurate even though it is raw. This is a very good sign, because this means that the other values like High, Low and Open will also be accurate.


Let's use some functions to extract the moving averages of the Facebook stock from the API

This Simple Moving Average(SMA) represents the mean of the data set for a given period. In that case the SMA's period is one month. SMAs are part of the technical analysis in predicting the future value of a stock. Let's plot it on the same chart as the price movement.

Such graphs can show very interesting data. For example, a change in direction of trend can be indicated by the penetration/crossover of the SMA. Generally a buy signal is generated when a price breaks above the moving average and sell signal is generated by a price break below the moving average. It is added confirmation when the moving average line turns in the direction of the price trend. When there is a high increase or decrease in a short period of time the SMA does not catch up immediately. That is because it uses average monthly values. Owing to this fact the Monthly SMA is good choice for long term predictions but less good for short term ones.

The weekly SMA looks more like the daily stock prices of Facebook. For this reason it can be better to use it for possible predictions as it is more sensitive to change in trends.


This OHLC chart(for open, high, low and close) is a style of financial chart describing the open, high, low and close values for a given date. The tip of the lines represent the low and high values and the horizontal lines repsent the open and close values. When there is an increase in the stock for a given day the lines are green and when there is a decrease - red. The bar below the graph can be used to filter the dates and observe a specific time period if needed.


Heatmaps are very useful to find relations between two variables in a dataset and this way the user gets a visualisation of the numeric data. No correlations are extremely high. Each square shows the correlation between the variables on each axis. Let's visualize a heatmap on all features in the data set.

The closer to 1 the correlation is the more positively correlated they are, that is as one increases so does the other and the closer to 1 the stronger this relationship is. It is noticeable that the correlation between Open and Close is high, which mean that they have a quite strong correlation.


Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two variables. The closer to 1 the correlation is the more positively correlated they are, that is as one increases so does the other and the closer to 1 the stronger this relationship is. In that case most of the values correlate fully with each other because the price of stock usually moves along the other features like low, high and open.


Both the heatmap and the correlation diagram show that there is no big relation between the volume of a stock and it's price. However, my research shows that Volume is a big part of the Technical analysis when a prediction is made for a specific stock. This is why we are going to display the volume of the Facebook stock now.

Even though in the heatmap and the correlation diagram the volume does not seem so important to the closing value it does have a huge impact on the price because this value represents the amount of time the stock has been bought/sold during the day. Let's try to display both values.

As expected, on the graph we can see that when there is a drastic change in the volume there is big impact on the price which either increases or decreases. That confirms the fact that Volume is an important feature when predicting the price.

It appears that Facebook that never given dividents to its shareholders. In that case it doesn't make sense to include it in the model, but since other companies do give dividents it would not be wise to drop such column so fast.

Just like the dividents, facebook has not split it's shares. Futher research is needed to see what potential impact does these two values have on the price of a stock.

Post research:

  • If a split occurs a stock's price is affected. After a split, the stock price will be reduced (since the number of shares outstanding has increased). For example if we have a 2 for 1 stock split(let's say the company had 10 million shares outstanding before the split, now it will have 20 million) the price of the stock will halve its price.
  • Before a dividend is distributed, the issuing company must first declare the dividend amount and the date when it will be paid. The declaration of a dividend naturally encourages investors to purchase stock which leads to increase to the stocks value.

Conclusion

Some very interesting patterns were found in the dataset, including ones that the heatmap and the correlation diagram could not find, but thanks to the domain research, I was able to find them. The next steps include data preparation. Thankfully the API returns very good data and currently I have not experienced problems with missing values in the dataset. Yet possibilities for combining this dataset with other data might be possible to see if I can find any other relations that could help in making the prediction as accurate as possible.